-
Notifications
You must be signed in to change notification settings - Fork 533
fix(wal): ensure WAL offset monotonicity under S3 backpressure #3034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chandi Charan Mahato <[email protected]>
|
Hi maintainers 👋 This fix addresses issue #3033 (WAL offset monotonicity violation under S3 backpressure). I verified the behavior with:
Happy to make any adjustments, add tests, or help with discussions around concurrency safety. Thanks for your time! |
Both are guarded by writeLock. I haven't found the source of the BUG yet. Can you provide a detailed BUG process? |
|
Hi @superhx — thanks for the review. SummaryIssue: Under S3 upload backpressure, concurrent bulks can claim the same Reproduction (high level)
Minimal technical explanation & timelineBecause both threads read the same What I changed in this PR
Key property: the critical section is short (only offset arithmetic and assignment) — it does not perform IO or buffer allocation, so it does not hurt throughput but prevents offset collisions under backpressure. Why this is safe and correct
Observability (what to check after applying PR)
Example WAL inspection (what to look for)
Additional notes
Thanks. |
|
Hi @Chandi977 , thanks for your contribution! IIUC, the scenario below couldn't happen. Since we update the Would you mind sharing more details on how to reproduce this error at a low level? Thanks! |
|
Hi @Rancho-7 and @superhx , thanks again for taking the time to review this. You are absolutely right that The issue I observed is not with bulk offset allocation, but rather with when per-record offsets are assigned relative to asynchronous upload completion. In the current implementation, I’ve prepared a minimal reproducible scenario and test to illustrate this more clearly. 🔍 Reproduction ScenarioEnvironment:
This leads to recovered WAL objects appearing in this order: The scenario below reproduces this in a controlled test environment. 🧪 Test Case: slowS3_at_OFFSET_MONOTONICITY@Test
public void slowS3_at_OFFSET_MONOTONICITY() throws Exception {
BackpressureSimulatingS3 s3 = new BackpressureSimulatingS3(realS3);
ObjectWAL wal = new ObjectWAL(config, s3);
// Thread 1 - intentionally slow
Thread t1 = new Thread(() -> {
s3.setDelay(150); // induce artificial delay
wal.append(createRecord(200)); // baseOffset = 2000
}, "S3-slow-thread");
// Thread 2 - fast execution
Thread t2 = new Thread(() -> {
s3.setDelay(0);
wal.append(createRecord(200)); // baseOffset = 2048
}, "S3-fast-thread");
t1.start();
Thread.sleep(10); // allow t1 to allocate offset first
t2.start();
t1.join();
t2.join();
List<WALObject> objs = wal.objectList();
List<Long> offsets = objs.stream()
.map(WALObject::startOffset)
.collect(Collectors.toList());
for (int i = 1; i < offsets.size(); i++) {
if (offsets.get(i) < offsets.get(i - 1)) {
fail("Offset regression detected at index " + i + ": " + offsets);
}
}
}❗ Error Output (Terminal Log)📌 Root Cause Summary
This does not corrupt data but can produce non-monotonic WAL ordering, which can affect recovery flows that assume monotonicity. 🔧 Proposed Fix (Minimal & Safe)Instead of assigning offsets inside the async upload path, offsets are claimed up-front in a small synchronized section before IO: private long claimOffsetRange(List<Record> records) {
synchronized (offsetClaimLock) {
long current = nextOffset.get();
long start = current;
for (Record r : records) {
r.offset = current; // assign now
current += r.size;
}
nextOffset.set(current);
return start;
}
}
private void uploadBulk0(Bulk bulk) {
// Claim offsets before any asynchronous work
long firstOffset = claimOffsetRange(records);
for (Record record : records) {
ByteBuf header =
WALUtil.generateHeader(data, header, 0, record.offset);
// ...
}
// other logics
}Benefits
✔ After Fix – Test PassesThank you. |
Summary
This PR fixes a correctness issue in the S3 WAL implementation where multiple concurrent bulk uploads may reuse the same
baseOffsetwhen S3 backpressure or slow storage causes delayed uploads.As a result, two or more bulks can start from the same offset, producing duplicate WAL offsets and violating the monotonicity guarantees required for Kafka-compatible behavior.
The root cause is that
nextOffsetwas only advanced after bulk upload scheduling, allowing concurrent bulks to observe identical starting offsets.Fix
This PR introduces atomic offset claiming using a lightweight lock:
offsetClaimLockclaimOffsetRange()which:nextOffsetbefore any upload or buffer creation startsThe change is isolated and does not modify the bulk batching logic, only the correctness of offset assignment.
Changes Included
DefaultWriter.java:offsetClaimLockclaimOffsetRange()helperuploadBulk0()to use atomic offset assignmentendOffsetaligns correctly viaceilAlignOffsetReproduction (Before Fix)
docker pause minio).Overlapping or repeated offsets are produced during concurrent uploads.
Validation (After Fix)
Issue Reference
Fixes: #3033
WAL offset monotonicity violation under S3 backpressure
Impact
✔ Guarantees monotonic offset progression
✔ Prevents WAL corruption under concurrency
✔ Safe for existing deployments
✔ Zero performance regression (offset lock is lightweight)
Notes for Reviewers
This patch focuses solely on correctness.
No behavior changes in batching, uploads, or IO.
Only offset assignment is made deterministic and atomic.
Happy to adjust tests or add benchmarks if required.